{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hackathon 8\n",
    "\n",
    "Written by Eleanor Quint\n",
    "\n",
    "Topics:\n",
    "- Eager Execution\n",
    "- OpenAI Gym\n",
    "- Deep Q Networks\n",
    "    - Replay memory\n",
    "    - Policy and target networks\n",
    "    - Gradient clipping\n",
    "\n",
    "This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code.\n",
    "\n",
    "Thanks to [OpenAI Baselines](https://github.com/openai/baselines) from which I've learned a lot about RL code. I reccomend starting there if you're doing any RL project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    }
   ],
   "source": [
    "# We'll start with our library imports...\n",
    "from __future__ import print_function\n",
    "\n",
    "import collections\n",
    "import math\n",
    "import os\n",
    "import random\n",
    "\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "import atari_wrappers           # from OpenAI Baselines\n",
    "import gym                      # for the RL environments\n",
    "import matplotlib.pyplot as plt # for plots\n",
    "\n",
    "tf.enable_eager_execution()\n",
    "print(tf.executing_eagerly())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might notice that we're loading many more libraries than usual. This is because, even for a minimal implementation of DQN and the RL learning problem, it's helpful to import a lot of functionality.\n",
    "\n",
    "Further, this hackathon will be using TensorFlow's [eager execution mode](https://www.tensorflow.org/guide/eager). This means that, rather than separating graph construction and execution, they are implicitly mixed. Eager mode does not use a Session, nor does it explicitly initialize Variables. Instead, Variables are treated like Python objects and TF function calls return their computed output immediately. This allows us to mix Python control flow with TensorFlow operations.\n",
    "\n",
    "The first thing we'll do, rather than loading data, is to create a learning environment. We're using some functions from `atari_wrappers.py` for this (because OpenAI writes really good code). This creates a [Gym environment](https://gym.openai.com/docs/) which provides a standard API for us to use. We could use one of a long list of [environments provided by Gym](https://gym.openai.com/envs/), but we'll start with the game that started it all, [Breakout](https://www.youtube.com/watch?v=TmPfTpjtdgg)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(84, 84, 4)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "env = atari_wrappers.wrap_deepmind(atari_wrappers.make_atari('BreakoutNoFrameskip-v4'), frame_stack=True)\n",
    "NUM_ACTIONS = env.action_space.n\n",
    "OBS_SHAPE = env.observation_space.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Environments have two main functions that we'll use: `env.reset()` and `env.step()`. \n",
    "\n",
    "`reset()` is used to start an episode and returns an initial observation\n",
    "```\n",
    "observation = env.reset()\n",
    "```\n",
    "\n",
    "`step()` is used to advance the environment by passing an action to be taken by the agent\n",
    "```\n",
    "observation, reward, done, info = env.step(action)\n",
    "```\n",
    "Each step returns a tuple of four values: the observation of the environment state (in our case, an image) that results from taking the action, the reward derived from taking the action at the previous state (usually a float), a boolean which indicates whether the episode has finished, and a dict with any extra information. We can get more specific information about the actions and observations with `env.action_space` and `env.observation_space`.\n",
    "\n",
    "This is an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.\n",
    "\n",
    "Next, we'll setup the replay memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Transition = collections.namedtuple('Transition',\n",
    "                        ('state', 'action', 'next_state', 'reward'))\n",
    "\n",
    "\n",
    "class ReplayMemory(object):\n",
    "\n",
    "    def __init__(self, capacity):\n",
    "        self.capacity = capacity\n",
    "        self.memory = []\n",
    "        self.position = 0\n",
    "\n",
    "    def push(self, *args):\n",
    "        \"\"\"Saves a transition.\"\"\"\n",
    "        if len(self.memory) < self.capacity:\n",
    "            self.memory.append(None)\n",
    "        self.memory[self.position] = Transition(*args)\n",
    "        self.position = (self.position + 1) % self.capacity\n",
    "\n",
    "    def sample(self, batch_size):\n",
    "        return random.sample(self.memory, batch_size)\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.memory)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The replay memory just stores $(s,a,r,s')$ transitions for us so we can sample them out of order. This is very important for DQN because, without de-correlating transitions by sampling them out of order, DQN's q-value predictions will almost always diverge. This also allows us to sample large training batches in the online RL setting (i.e., when we're training as we're gathering experiences)\n",
    "\n",
    "Then, to specify the network itself, we'll use [tf.keras](https://www.tensorflow.org/guide/keras), a library built-in to TensorFlow which provides convenience functions. We'll set up a Keras [Sequential model](https://www.tensorflow.org/api_docs/python/tf/keras/models/Sequential) because we're specifying a simple network, which acts as a python callable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def dqn_network(input_shape, action_num):\n",
    "    return tf.keras.Sequential([\n",
    "        tf.keras.layers.Conv2D(16, 5, 2, input_shape=input_shape),\n",
    "        tf.keras.layers.Conv2D(32, 5, 2),\n",
    "        tf.keras.layers.Conv2D(32, 5, 2),\n",
    "        tf.keras.layers.Flatten(),\n",
    "        tf.keras.layers.Dense(action_num)\n",
    "    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to write a function which controls the exploration process. We'll use a method called [epsilon-greedy exploration](http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf) where we explore (take a random action) with probability $\\epsilon$ and exploit (choose the greedy action from the policy network) with probability $1-\\epsilon$. We initially set $\\epsilon$ to a large value to explore a lot, and then decay it over training to a small value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "EPS_START = 1.\n",
    "EPS_END = 0.1\n",
    "EPS_DECAY = 100000 # number of over which to decay EPS, i.e., after n steps, EPS == EPS_END\n",
    "\n",
    "def select_eps_greedy_action(policy_model, obs, step, num_actions):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "        policy_model (callable): mapping of `obs` to q-values\n",
    "        obs (np.array): current state observation\n",
    "        step (int): training step count\n",
    "        num_actions (int): number of actions available to the agent\n",
    "    \"\"\"\n",
    "    eps_threshold = EPS_END + (EPS_START - EPS_END) * math.exp(-1. * step / EPS_DECAY)\n",
    "    \n",
    "    if random.random() > eps_threshold: # exploit\n",
    "        action = tf.argmax(policy_model(tf.convert_to_tensor(obs)), axis=1)\n",
    "    else: # explore\n",
    "        action = random.randrange(num_actions)\n",
    "    return action"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we'll use the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) on the TD-error to calculate gradients. It has quadratic behavior below $\\delta$ near 0, and linear behavior above $\\delta$. This allows the error to grow quickly as it moves away from zero, but not blow up too quickly.\n",
    "\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/c/cc/Huber_loss.svg/1920px-Huber_loss.svg.png\" width=\"50%\">\n",
    "\n",
    "We'll use [tf.where](https://www.tensorflow.org/api_docs/python/tf/where) to implement the piecewise function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def huber_loss(x, delta=1.0):\n",
    "    \"\"\"Reference: https://en.wikipedia.org/wiki/Huber_loss\"\"\"\n",
    "    return tf.where(\n",
    "        tf.abs(x) < delta,\n",
    "        tf.square(x) * 0.5,\n",
    "        delta * (tf.abs(x) - 0.5 * delta)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function which calculates gradients is long, but not too complex. The first step is to sample a batch of transitions from the replay memory and rearrange them so they can be used easily. Then, we'll use a feature of Eager execution called the [GradientTape](https://www.tensorflow.org/tutorials/eager/automatic_differentiation), which allows us to calculate gradients without a fixed graph structure. Within that scope, we'll calculate the [TD-error](https://en.wikipedia.org/wiki/Temporal_difference_learning) as the difference between Q-value output by the policy network and the Q-value calculated with one-step lookahead, $Q(s_t,a) = r_t + \\gamma \\max_a Q(s_{t+1},a)$. We'll calculate the Huber loss, and then get the gradients of the policy network variables from [tf.GradientTape.gradient](https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient). Finally, gradients are clipped to help ensure stability of the Q-value estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def dqn_gradients(replay_memory, policy_model, target_model, batch_size, gamma=0.99, grad_norm_clipping=1.0):\n",
    "    # before enough transitions are collected to form a batch\n",
    "    if len(replay_memory) < batch_size:\n",
    "        return None, None\n",
    "\n",
    "    # prepare training batch\n",
    "    transitions = replay_memory.sample(batch_size)\n",
    "    batch = Transition(*zip(*transitions))\n",
    "    next_states = np.array(batch.next_state, dtype=np.float32)\n",
    "    state_batch = np.array(batch.state, dtype=np.float32)\n",
    "    action_batch = np.array(batch.action, dtype=np.int64)\n",
    "    reward_batch = np.array(batch.reward)\n",
    "\n",
    "    with tf.GradientTape() as tape:\n",
    "        # calculate value from taking action\n",
    "        action_idxs = np.stack([np.arange(batch_size, dtype=np.int32), action_batch], axis=1)\n",
    "        state_action_values = tf.gather_nd(policy_model(state_batch), action_idxs)\n",
    "        \n",
    "        # calculate best value at next state\n",
    "        next_state_values = tf.reduce_max(target_model(next_states), axis=1)\n",
    "\n",
    "        # compute the expected Q values\n",
    "        expected_state_action_values = (next_state_values * gamma) + reward_batch\n",
    "\n",
    "        # compute Huber loss on TD-error\n",
    "        td_error = state_action_values - expected_state_action_values\n",
    "        loss = huber_loss(td_error)\n",
    "        gradients = tape.gradient(loss, policy_model.trainable_variables)\n",
    "\n",
    "    # clip gradients\n",
    "    for i, grad in enumerate(gradients):\n",
    "        if grad is not None:\n",
    "            gradients[i] = tf.clip_by_norm(grad, grad_norm_clipping)\n",
    "    return loss, gradients"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll actually perform training. We setup the models and then, at each step, collect one transition with the current policy network and then train with one batch from the replay memory. At some interval we'll update the target network to the most recent parameters of the policy network with `tf.keras.model.Sequential`'s `get_weights` and `set_weights` functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "TARGET_UPDATE_STEP_FREQ = 10\n",
    "BATCH_SIZE = 64\n",
    "EPISODE_NUM = 20\n",
    "REPLAY_BUFFER_SIZE = 100000\n",
    "LEARNING_RATE = 1e-3\n",
    "\n",
    "# setup models, replay memory, and optimizer\n",
    "policy_model = dqn_network(OBS_SHAPE, NUM_ACTIONS)\n",
    "target_model = dqn_network(OBS_SHAPE, NUM_ACTIONS)\n",
    "replay_memory = ReplayMemory(REPLAY_BUFFER_SIZE)\n",
    "optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE)\n",
    "    \n",
    "step = 0\n",
    "for episode in range(EPISODE_NUM):\n",
    "    # initialize environment\n",
    "    prev_observation = env.reset()\n",
    "    observation, reward, done, _ = env.step(random.randrange(NUM_ACTIONS))\n",
    "    done = False\n",
    "    ep_score = 0.\n",
    "\n",
    "    while not done: # until the episode ends\n",
    "        # select and perform an action\n",
    "        prepped_obs = np.expand_dims(np.array(observation, dtype=np.float32), axis=0)\n",
    "        action = select_eps_greedy_action(policy_model, prepped_obs, step, NUM_ACTIONS)\n",
    "        observation, reward, done, _ = env.step(action)\n",
    "        # add to memory\n",
    "        replay_memory.push(prev_observation, action, observation, reward)\n",
    "        prev_observation = observation\n",
    "\n",
    "        # train model\n",
    "        loss, grads = dqn_gradients(replay_memory, policy_model, target_model, BATCH_SIZE)\n",
    "        if grads is not None:\n",
    "            optimizer.apply_gradients(zip(grads, policy_model.trainable_variables))\n",
    "        # increment counters\n",
    "        ep_score += reward\n",
    "        step += 1\n",
    "\n",
    "    # update the target network, copying all variables in DQN\n",
    "    if episode % TARGET_UPDATE_STEP_FREQ == 0:\n",
    "        target_model.set_weights(policy_model.get_weights())\n",
    "    print(\"Episode {} achieved score {} at {} training steps\".format(episode, ep_score, step))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's basically how we implement [Mnih et al.'s Deep Q Network](https://www.nature.com/articles/nature14236)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## No Hackathon due this week.\n",
    "Enjoy life, or work on your project or something."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (rl)",
   "language": "python",
   "name": "rl"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
